Developer Home Contents Search Feedback Support Intel(r)

Application Note

Using MMX™ Instructions to Convert 24-Bit
True Color Data to 16-Bit High Color

Disclaimer
Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel's Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice.

Copyright © Intel Corporation (1996). Third-party brands and names are the property of their respective owners.

1.0. INTRODUCTION

2.0. RGB32TO16 FUNCTIONS

  • 2.1. Mask-Shift-Or Algorithm
  • 2.2. Multiply-Add Algorithm

    APPENDIX A: Mask-Shift-Or Algorithm

    APPENDIX B: PMADD Algorithm

  • 1.0. INTRODUCTION

    The Intel Architecture (IA) media extensions include single-instruction, multi-data (SIMD) instructions. This application note presents examples of code that exploit these instructions. Two RGB24to16 functions are examined that use the new MMX instructions PSRLD, PXOR, PACKSS, and PMADD to complete the conversion. The performance improvement relative to traditional IA code can be attributed primarily to the much faster shift instructions. Whereas the IA shift instruction (SHIFT) takes four cycles on a Pentium® processor, the MMX shift instruction (PSHIFT) takes only one cycle. Also, the new PMADD instruction combines a multiply-and-add function which has a throughput of one and a latency of three cycles.

    2.0. RGB32TO16 FUNCTIONS

    Many applications provide RGB data with the assumption that the video display will use 24-bit true color data. This is especially true of three-dimensional (3D) applications that perform calculations to generate true color values for the pixels to be displayed. Each byte contains one byte of Red, Green, or Blue data. There continues to be, however, an abundance of video displays that require 16-bit high color data. Visually, the conversion to RGB555 is straightforward. There are other pixel formats, such as RGB565, to which the techniques of this paper can be easily applied.

    Figure 1. 24-bit True Color to 16-bit High Color Conversion

    In the sections that follow, two conversion methods will be discussed. The first method, which will be referred to as the Mask-Shift-Or method, is the algorithm that is traditionally used to complete the conversion. This method uses very few MMX registers, making it possible to interleave the processing of pixels to efficiently use the processor's clock cycles.

    The second method, which will be referred to as the PMADD method, makes use of the multiply-and-add instruction to shift the bits into their appropriate positions. This method uses the multiply instruction to shift the data and the add instruction to combine the bits together. Even though the PMADD instruction has a latency of three cycles, it is a pipelined instruction that can be executed in either the U or V pipe, allowing other instructions to be paired with it. As illustrated in Section 2.2, this method takes advantage of this and processes the pixels even more efficiently than the Mask-Shift-Or algorithm.

    Mask-Shift-Or Algorithm

    The Mask-Shift-Or algorithm takes each 24-bit true color element, stored in the three least significant bytes of a double word, masks each 8-bit color, shifts it right three bits, and ORs the result into a register. Using the 64-bit MMX registers and instructions, two pixels can be processed at the same time.

    Example 1. Mask-Shift-Or Basic Algorithm

    
    	1	movq	mm0, [eax]		;get two 24-bit pixels
    	2	movq	mm1, mm0		;save the original data
    	3	pand	mm0, BLUES		;mask out all but the 5 MSB blue bits
    	4	psrld	mm0, 3			;shift blue bits to bits 0-4
    	5	movq	mm2, mm1		;save the original data again
    	6	pand 	mm1, GREENS		;mask out all but the 5 MSB green bits
    	7	psrld	mm1, 6			;shift green bits to bits 5-9
    	8	por	mm0, mm1		;or in the green bits with the blue bits
    	9	pand	mm2, REDS		;mask out all but the 5 MSB red bits
    	10	psrld	mm2, 9			;shift red bits to bits 10-14
    	11	por	mm0, mm2		;or in the red bits
    						;mm0 now contains 2 16-bit color elements
    						;in the low word of each DWORD
    	12	packssdw mm0, ZERO		;pack the 2 16-bit elements into low DWORD
    

    The code example shown in Example 1 does little to take advantage of instruction pairing and uses eleven cycles to process two pixels. It also leaves the upper DWORD of the result register unused. By processing another two pixels and pairing the instructions to make the best use of the U and V pipes, four pixels can be processed in thirteen cycles or 3.25 cycles per pixel. The comments indicate which pair of pixels is being affected and the instruction step as illustrated in Example 1.

    Example 2. Mask-Shift-Or Paired Algorithm

    
    1	movq		mm0, [eax]	;(A1)	get two 24-bit pixels
    2	movq		mm3, 8[eax]	;(B1)	get next two 24-bit pixels
    3	movq		mm1, mm0	;(A2)	save the original first two
    4	pand		mm0, BLUES	;(A3)	mask out all but the 5 MSB blue bits
    5	movq		mm4, mm3	;(B2)	save the original second two
    6	pand		mm3, BLUES	;(B3)	mask out all but the 5 MSB blue bits
    7	psrld		mm0, 3		;(A4)	shift blue bits to bits 0-4
    8	psrld		mm3, 3		;(B4)	shift blue bits to bits 0-4
    9	movq		mm2, mm1	;(A5)	save the original data again
    10	pand		mm1, GREENS	;(A6)	mask out all but the 5 MSB green bits
    11	movq		mm6, mm4	;(B5)	save the original data again
    12	pand		mm4, GREENS	;(B6)	mask out all but the 5 MSB green bits
    13	psrld		mm1, 6		;(A7)	shift green bits to bits 5-9
    14	por		mm0, mm1	;(A8)	or in the green bits with the 
    						;	blue bits
    15	psrld		mm4, 6		;(B7)	shift green bits to bits 5-9
    16	pand		mm2, REDS	;(A9)	mask out all but the 5 MSB red bits
    17	por		mm3, mm4	;(B8)	or in the green bits with the 
    						;	blue bits
    18	pand		mm6, REDS	;(B9)	mask out all but the 5 MSB red bits
    19	psrld		mm2, 9		;(A10)	shift red bits to bits 10-14
    20	por		mm0, mm2	;(A11)	or in the red bits
    21	psrld		mm6, 9		;(B10)	shift red bits to bits 10-14
    22	por		mm3, mm6	;(B11)	or in the red bits
    23	packssdw	mm0, mm3	;(AB12) pack the 4 16-bit pixels into 
    					;	one qword
    

    It would have been more straightforward to simply dedicate the U pipe to the A pair of pixels and the V pipe to the B pair. However, memory accesses can only be done in the U pipe making it necessary to mix the use of the two pipes between the two pairs of pixels.

    Two instructions still remain unpaired at the end of the paired algorithm. However, because each pair of pixels is converted using only three MMX registers, four more pixels can be processed if the instructions are interleaved allowing more pairing to be done. This algorithm is completed in twenty-four cycles for each set of eight pixels or three cycles per pixel.

    Example 3. Mask-Shift-Or Interleaved Algorithm

    
    1	movq		mm0, [eax]	;(A1)	get two 24-bit color elements
    2	movq		mm3, 16[eax]	;(C1)	get third two 24-bit pixels
    3	movq		mm1, mm0	;(A2)	save the original first two
    4	pand		mm0, BLUES	;(A3)	mask out all but the 5 MSB blue bits
    5	movq		mm4, mm3	;(C2)	save the original second two
    6	pand		mm3, BLUES	;(C3)	mask out all but the 5 MSB blue bits
    7	psrld		mm0, 3		;(A4)	shift blue bits to bits 0-4
    8	psrld		mm3, 3		;(C4)	shift blue bits to bits 0-4
    9	movq		mm2, mm1	;(A5)	save the original data again
    10	pand 		mm1, GREENS	;(A6)	mask out all but the 5 MSB 
    					;	green bits
    11	movq		mm6, mm4	;(C5)	save the original data again
    12	pand		mm4, GREENS	;(C6)	mask out all but the 5 MSB 
    					;	green bits
    13	psrld		mm1, 6		;(A7)	shift green bits to bits 5-9
    14	por		mm0, mm1	;(A8)	or in the green bits with the blue
    15	psrld		mm4, 6		;(C7)	shift green bits to bits 5-9
    16	pand		mm2, REDS	;(A9)	mask out all but the 5 MSB red bits
    17	por		mm3, mm4	;(C8)	or in the green bits with the blue
    18	pand		mm6, REDS	;(C9)	mask out all but the 5 MSB red bits
    19	psrld		mm2, 9		;(A10)	shift red bits to bits 10-14
    20	por		mm0, mm2	;(A11)	or in the red bits
    21	psrld		mm6, 9		;(C10)	shift red bits to bits 10-14
    22	movq		mm1, 24[eax]	;(D1)	get fourth two 24-bit pixels
    23	por		mm6, mm3	;(C11)	or in the red bits
    24	movq		mm3, 8[eax]	;(B1)	get second two 24-bit pixels
    25	movq		mm2, mm1	;(D2)	save the original data
    26	pand		mm1, BLUES	;(D3)	mask out all but the 5 MSB blue bits
    27	movq		mm4, mm3	;(B2)	save the original data
    28	pand		mm3, BLUES	;(B3)	mask out all but the 5 MSB blue bits
    29	psrld		mm1, 3		;(D4)	shift blue bits to bits 0-4
    30	psrld		mm3, 3		;(B4)	shift blue bits to bits 0-4
    31	movq		mm5, mm4	;(B5)	save the original data again
    32	pand		mm4, GREENS	;(B6)	mask out all but the 5 MSB 
    					;	green bits
    33	movq		mm7, mm2	;(D5)	save the original data again
    34	pand		mm2, GREENS	;(D6)	mask out all but the 5 MSB 
    					;	green bits
    35	psrld		mm4, 6		;(B7)	shift green bits to bits 5-9
    36	por		mm3, mm4	;(B8)	or in the green bits with the blue
    37	psrld		mm2, 6		;(D7)	shift green bits to bits 5-9
    38	pand		mm5, REDS	;(B9)	mask out all but the 5 MSB red bits
    39	por		mm1, mm2	;(D8)	or in the green bits with the blue
    40	pand		mm7, REDS	;(D9)	mask out all but the 5 MSB red bits
    41	psrld		mm5, 9		;(B10)	shift red bits to bits 10-14
    42	por		mm3, mm5	;(B11)	or in the red bits
    43	psrld		mm7, 9		;(D10)	shift red bits to bits 10-14
    44	packssdw	mm0, mm3	;(AB12) pack result 1 and 2 into one qword
    45	por		mm7, mm1	;(D11)	or in the red bits
    46	packssdw	mm6, mm7	;(CD12) pack result 3 and 4 into one qword
    

    All that remains is to add instructions to write the results out to memory and to loop through the source. See Appendix A for the full source code for the Mask-Shift-Or algorithm. The instructions to calculate the value of the loop counter were written for this application and may need to be changed to meet the needs of the user's application. Also note that the source data is processed from the end of the source array instead of the beginning in order to eliminate the need for a compare instruction before the jump to the beginning of the loop.

    Multiply-Add Algorithm

    The Multiply-Add algorithm is less straightforward but takes advantage of the PMADD instruction to shift and combine the bits into their final position. It also uses the 64-bit MMX registers and instructions to convert two 24-bit true color elements to 16-bit high color at the same time.

    The PMADD instruction is capable of multiplying each word of a quaDWORD by a unique word value. It then adds the two low-order words and puts the result in the lower doubleword of the result register. Likewise, the two high-order words are added with the result written into the high order double word of the result register. This instruction only supports word multiplication's with the result of the PMADD instruction written as double words.

    The first trap to avoid is the assumption that it is necessary to unpack the three bytes of color information into three words in order to use the PMADD instruction. By initially ignoring the green byte, the red and blue bytes can be operated on as words. By carefully choosing the multiplication factor used for each of these colors, the red and blue bytes can be shifted into their relative final position and simply OR-ORed with the green bits.

    The flow of the basic algorithm is depicted in Figure 2.

    Figure 2. PMADD Basic Algorithm

    Example 4. PMADD Basic Algorithm

    
    1	movq		mm0, [eax]		;get two 24-bit pixels
    2	movq		mm1, mm0		;save the original data
    3	pand		mm0, REDBLUE		;mask out all but the 5MSBits of red and blue
    4	pmaddwd		mm0, MULFACT		;multiply each word by
    						;2**13, 2**3, 2**13, 2**3 and add results
    5	pand		mm1, GREEN		;mask out all but the 5MSBits of green
    6	por		mm0, mm1		;combine the red, green, and blue bits
    7	psrld		mm0, 6			;shift to final position
    

    Example 4 shows the implementation of the basic algorithm using MMX instructions. As in the Mask-Shift-Or method, processing only one pair of pixels does little to take advantage of the processor's ability to pair instructions. In fact, none of the instructions in the basic algorithm pair. This algorithm is very efficient, however, in its use of registers allowing for more pixels to be processed within the same loop.

    By processing another six pixels and pairing the instructions to make the best use of the U and V pipes, eight pixels can be processed in nineteen cycles or 2.375 cycles per pixel. The comments indicate which pair of pixels is being affected and the instruction step as illustrated in Example 4. Unlike the PMADD instructions in lines 7 and 8, the PMADD instruction in lines 21 and 24 cause pipeline stalls because the results need to be used before the three clock cycles that the multiplier needs have been completed. Also, because mm5 is used as the destination in the last three instructions, these instructions do not pair.

    Example 5. PMADD Paired Algorithm

    
    ;assume mm6 holds mask for 5MSbits of green
    ;assume mm7 holds the multiplication factor 213,23,213,23
    
    1	movq		mm2, 8[eax*4]		;(B1)	get second two 24-bit pixels
    2	movq		mm0, [eax]		;(A1)	get first two 24-bit pixels
    3	movq		mm3, mm2		;(B2)	save the original data
    4	pand		mm3, REDBLUE		;(B3)	mask for 5MSbits of red and blue
    5	movq		mm1, mm0		;(A2)	save the original data
    6	pand		mm1, REDBLUE		;(A3)	mask for 5Msbits of red and blue
    7	pmaddwd		mm3, mm7		;(B4)	use multiply to shift and or
    						;	can't use mm3 for 3 cycles!
    8	pmaddwd		mm1, mm7		;(A4)	but we can do another multiply
    9	pand		mm2, mm6		;(B5)	mask for 5MSbits of green
    10	movq		mm4, 24[eax]		;(D1)	get fourth two 24-bit pixels
    11	pand		mm0, mm6		;(A5)	mask for 5MSbits of green
    12	movq		mm5, 16[eax]		;(C1)	get third two 24-bit pixels
    13	por		mm3, mm2		;(B6)	combine the red, green, and blue
    14	psrld		mm3, 6			;(B7)	shift to final position
    15	por		mm1, mm0		;(A6)	combine the red, green, and blue
    16	movq		mm0, mm4		;(D2)	save the original data
    17	psrld		mm1, 6			;(A7)	shift to final position
    18	pand		mm0, REDBLUE		;(D3)	mask for 5MSbits of red and blue
    19	packssdw	mm1, mm3		;(AB8)	pack result into one qword
    20	movq		mm3, mm5		;(C2)	save the original data
    21	pmaddwd		mm0, mm7		;(D4)	use multiply to shift and or
    						;	can't use mm0 for 3 cycles!
    22	pand		mm3, REDBLUE		;(C3)	mask for 5MSbits of red and blue
    23	pand		mm4, mm6		;(D5)	mask for 5MSbits of green
    24	pmaddwd		mm3, mm7		;(C4)	use multiply to shift and or
    25	por		mm4, mm0		;(D6)	combine the red, green, and blue
    						;	pipeline stall waiting for mm0
    						;	because pmadd needs 3 cycles
    26	pand		mm5, mm6		;(C5)	mask for 5MSbits of green
    27	psrld		mm4, 6			;(D7)	shift to final position
    28	por		mm5, mm3		;(C6)	combine the red, green, and blue
    						;	pipeline stall waiting for mm3
    						;	because pmadd needs 3 cycles!
    29	psrld		mm5, 6			;(C7)	shift to final position
    30	packssdw	mm5, mm4		;(C8)	pack result into one qword
    

    In order to achieve better pairing and make use of the three clock cycles after a PMADD instruction, the code can be adjusted to loop through the source code starting in the middle of the code shown in Example 5. By doing this, the instructions to read and begin processing the first two pairs of pixels can be done while waiting for the PMADD results in lines 21 and 24 to be usable and paired with lines 28 through 30. The loop control can also be added for "free" reducing the processing time to 17 cycles for eight pixels or 2.125 cycles per pixel. All of the instructions within the loop now pair. Of course, additional instructions are needed before the start of the loop in order correctly begin the loop. For large sets of data, the increase in efficiency quickly makes up for the extra cycles taken outside of the loop.

    See Appendix B for the full source code for the PMADD algorithm. The instructions to calculate the value of the loop counter were written for this application and may need to be changed to meet the needs the user's application. Again, note that the source data is processed from the end of the source array instead of the beginning in order to eliminate the need for a compare instruction before the jump to the beginning of the loop.

    APPENDIX A: Mask-Shift-Or Algorithm

    ;*
    ;* Description:
    ;*	The purpose of this file is to provide the MMX code for the 
    ;*	RGB24to16 algorithm as an instructional example to those who
    ;*	are just beginning to code using MMX instructions.
    ;*
    ;* Assumptions:
    ;*	1.	The number of elements allocated for the source (src) must be 
    ;*		divisible by 8. The number of rows X the number of columns does 
    ;*		not need to be divisible by 8.  This is to allow working on 8 
    ;*		pixels within the inner loop without having to post-process pixels 
    ;*		after the loop.
    ;*	2.	The number of elements allocated for the destination (dest) must 
    ;*       	be divisible by 8.  The number of rows X the number of columns 
    ;*       	does not need to be divisible by 8.  This is to allow working on 
    ;*       	8 pixels within the inner loop without having to post-process 
    ;*       	pixels after the loop.
    ;*
    ;*************************************************************************/
    ;
    .586
    .MODEL FLAT, C
    PD EQU <DWORD PTR>
    PW EQU <WORD PTR>
    PB EQU <BYTE PTR>
    ;----------------------------- --------------------------------------------
    .CODE
    RGB24to16 PROC C PUBLIC USES esi edi ebp ebx ecx, src:PTR DWORD,
                                                      dest:PTR DWORD,
                                                      nRows:PTR DWORD,
                                                      nCols:PTR DWORD
    ; Locals (on local stack frame)
    saveesp			EQU PD [esp+0]
    EndOfLocals            	EQU PD [esp+4]
    LocalFrameSize = 4
    ;constants for MMX register initialization
    .DATA
    blues   dq 0f8000000f8h             ;mask for 5 MSbits of blue data
    greens  dq 0f8000000f800h           ;mask for 5 MSbits of green data
    reds    dq 0f8000000f80000h         ;mask for 5 MSbits of red data
    .CODE
    MEM_MASK_BLUES EQU DWORD PTR blues
    MEM_MASK_GREENS EQU DWORD PTR greens
    MEM_MASK_REDS EQU DWORD PTR reds
    ;**************************************************************************
    ;*
    ;* Procedure: rgb24to16					Date: 12/08/95
    ;*
    ;* Author: Patricia L. Gray				File: rgb24to16.asm
    ;*
    ;* Description:
    ;*	rgb24to16 is an optimized MMX routine to convert RGB data from 
    ;*	24 bit true color to 16 bit high color.  The inner loop processes 
    ;*	8 pixels at a time and packs the 8 pixels represented as 8 DWORDs
    ;*	into 8 WORDs. The algorithm used for each 2 pixels is as follows:
    ;*	Step 1:		read in 2 pixels as a quad word
    ;*	Step 2:		make a copy of the two pixels
    ;*	Step 3:		AND the 2 pixels with 0x00f8000000f8 to obtain a 
    ;*				quad word of:
    ;*	000000000000000000000000BBBBB000 000000000000000000000000bbbbb000
    ;*
    ;*	Step 4:		PSRLD quad word by 3 to obtain a quad word of:
    ;*	000000000000000000000000000BBBBB 000000000000000000000000000bbbbb
    ;*
    ;*	Step 5:		make a copy of the original again
    ;*	Step 6:		AND the copy of the original pixels with 
    ;*				0x0000f8000000f800 to obtain
    ;*	0000000000000000GGGGG00000000000 0000000000000000ggggg00000000000
    ;*
    ;*	Step 7		PSRLD quad word by 6 to obtain a quad word of:
    ;*	0000000000000000000000GGGGG00000 0000000000000000000000ggggg00000
    ;*
    ;*	Step 8		OR the results of Step 4 and 7 to obtain a quad
    ;*				word of:
    ;*	0000000000000000000000GGGGGBBBBB 0000000000000000000000gggggbbbbb
    ;*
    ;*	Step 9		AND the last copy of the original with
    ;*				0x00f8000000f80000h to obtain
    ;*	00000000RRRRR0000000000000000000 00000000rrrrr0000000000000000000
    ;*
    ;*	Step 10		PSRLD quad word by 9 to obtain a quad word of:
    ;*	00000000000000000RRRRR0000000000 00000000000000000rrrrr0000000000
    ;*
    ;*	Step 11		OR the results of Step 8 and 10 to obtain a quad
    ;*				word of:
    ;*	00000000000000000RRRRRGGGGGBBBBB 00000000000000000rrrrrgggggbbbbb
    ;*
    ;*	Step 8:		When two pairs of pixels are converted, pack the 
    ;*				results into one register and then store them into
    ;*				the destination.
    ;*
    ;* Inputs:
    ;*	src		long int *	a pointer to the first element
    ;*					of the input source
    ;*	dest		short int *	a pointer to the destination 
    ;*					of the converted RGB data
    ;*	nRows		short int *	the number of rows in the src/dest bitmap
    ;*	nCols		short int *	the number of columns in the src/dest bitmap
    ;*
      mov	ecx,esp
      sub esp,LocalFrameSize
      and	esp, 0fffffff8h		;8-byte align start of local stack frame
      mov	saveesp, ecx		;save original esp to restore in epilgue
      mov   eax, nRows         	;multiply nRows
      mov   ebx, nCols           	;with nCols
      imul  ebx                 	;to get the size of the source
      mov   ecx, eax            	;compare with loop counter
      sub   ecx, 8
      
      mov	eax, src			;load the src and	 
      mov	ebx, dest             	;and dest pointers
    inner_loop:
      movq  mm0, [eax+4*ecx]    	;get the first 2 RGB elements
      
      movq  mm3, [eax+4*ecx+16] 	;get the third 2 RGB elements
      movq  mm1, mm0           	;save original first 2
      
      pand  mm0, MEM_MASK_BLUES 	;mask out all but the 5 MSB blue data
      movq  mm4, mm3           	;save the original third 2
      
      pand  mm3, MEM_MASK_BLUES 	;mask out all but the 5 MSB blue data
      psrld mm0, 3           	;shift blues right 3 which is now the result reg 1
      
      psrld mm3, 3             	;shift blues right 3 which is now the result reg 3
      movq  mm2, mm1           	;save the original again
      
      pand  mm1, MEM_MASK_GREENS	;mask out all but the 5MSB green data
      movq  mm6, mm4        	;save the original again
      
      pand  mm4, MEM_MASK_GREENS 	;mask out all but the 5MSB green data
      psrld mm1, 6            	;shift greens to bits 5-9
      
      por   mm0, mm1            	;combine with the blue data in result reg 1
      psrld mm4, 6             	;shift greens to bits 5-9
      
      pand  mm2, MEM_MASK_REDS 	;mask out all but the 5MSB red data
      por   mm3, mm4         	;combine with the green data in result reg 2
      
      pand  mm6, MEM_MASK_REDS 	;mask out all but the 5MSB red data
      psrld mm2, 9              	;shift reds to bits 10-14
      
      por   mm0, mm2            	;combine with the blue and green data in result reg
      psrld mm6, 9            	;shift reds to bits 10-14
      
      movq  mm1, [eax+4*ecx+24]	;get fourth 2 RGB elements
      por   mm6, mm3        	;combine with the blue and green data in result reg
      
      movq  mm3, [eax+4*ecx+8]	;get second 2 RGB elements
      movq  mm2, mm1            	;save the original fourth 2
      pand  mm1, MEM_MASK_BLUES 	;mask out all but the 5 MSB blue data
      movq  mm4, mm3           	;save original second 2
      pand  mm3, MEM_MASK_BLUES	;mask out all but the 5 MSB blue data
      psrld mm1, 3           	;shift blues right 3 which is now the result reg 4
      
      psrld mm3, 3            	;shift blues right 3 which is now the result reg 2
      movq  mm5, mm4       		;save the original again
      pand  mm4, MEM_MASK_GREENS	;mask out all but the 5 MSB green data
      movq  mm7, mm2         	;save the original again
      pand  mm2, MEM_MASK_GREENS	;mask out all but the 5 MSB green data
      psrld mm4, 6			;shift greens to bits 5-9
      por   mm3, mm4          	;combine with the blue data in result reg 2
      psrld mm2, 6            	;shift greens to bits 5-9
      pand  mm5, MEM_MASK_REDS 	;mask out all but the 5 MSB red data
      por   mm1, mm2          	;combine with the blue data in result reg 4
      pand  mm7, MEM_MASK_REDS  	;mask out all but the 5 MSB red data
      psrld mm5, 9            	;shift reds to bits 10-14
      por   mm3, mm5         	;combine with the blue and green data in result reg 2
      psrld mm7, 9            	;shift reds to bits 10-14
      packssdw  mm0, mm3      	;pack result 1 and 2 into one qword
      por   mm7, mm1          	;combine with the blue and green data in result reg 4
      packssdw  mm6, mm7      	;pack result 3 and 4 into one qword
      
      movq  [ebx+2*ecx], mm0  	;store the result
      
      movq  [ebx+2*ecx+8], mm6  	;store the result
      
      sub   ecx, 8
      jae   inner_loop       	;go get some more if not done
      
    DONE:
      emms				; function epilogue
      mov	esp, saveesp
      ret
    RGB24to16 ENDP
    END
    

    APPENDIX B: PMADD Algorithm

    
    ;* Description:
    ;*	The purpose of this file is to provide the MMX code for the 
    ;*	RGB24to16 algorithm as an instructional example to those who
    ;*	are just beginning to code using MMX instructions.
    ;*
    ;* Assumptions:
    ;*	1.	The number of elements allocated for the pMatrix must be 
    ;*		divisible by 8. The number of rows X the number of columns does 
    ;*		not need to be divisible by 8.  This is to allow working on 8 
    ;*		pixels within the inner loop without having to post-process pixels 
    ;*		after the loop.
    ;*	2.	The number of elements allocated for qMatrix must be divisible 
    ;*		by 8.  The number of rows X the number of columns does not need 
    ;*		to be divisible by 8.  This is to allow working on 8 pixels within 
    ;*		the inner loop without having to post-process pixels after the loop.
    ;*
    ;****************************************************************************/
    	TITLE	Convert RGB 24 to 16
    	.486P
    .model FLAT
    ;****************************************************************************
    ;* 				DATA SEGMENT
    ;****************************************************************************
    _DATA	SEGMENT
    rgbMulFactor	    	dq  2000000820000008H  	; RGB quad word multiplier
    rgbMask1    	    	dq  00f800f800f800f8H
    rgbMask2	    		dq  0000f8000000f800H
    _DATA ENDS
    ;****************************************************************************
    ;* 				TEXT SEGMENT
    ;****************************************************************************
    _TEXT	SEGMENT 
    ;
    ; Declare rgb24to16 as a public routine to allow the 'C' code to
    ; call it.
    ;
    PUBLIC RGB24to16
    ;* Description:
    ;*	rgb24to16 is an optimized MMX routine to convert RGB data from 
    ;*	24 bit true color to 16 bit high color.  The inner loop processes 8 
    ;*	pixels at a time and packs the 8 pixels represented as 8 DWORDs 
    ;*	into 8 WORDs. The algorithm used for each 2 pixels is as follows:
    ;*	Step 1:		read in 2 pixels as a quad word
    ;*	Step 2:		make a copy of the two pixels
    ;*	Step 3:		AND the 2 pixels with 0x00f800f800f800f8 to obtain a 
    ;*				quad word of:
    ;*	00000000RRRRR00000000000BBBBB000 00000000rrrrr00000000000bbbbb000
    ;*
    ;*	Step 4:		PMADDWD this quad word by 0x2000000820000008 to obtain
    ;*				a quad word of:
    ;*	00000000000RRRRR00000BBBBB000000 00000000000rrrrr00000bbbbb000000
    ;*
    ;*	Step 5:		AND the copy of the original pixels with 
    ;*				0x0000f8000000f800 to obtain
    ;*	0000000000000000GGGGG00000000000 0000000000000000ggggg00000000000
    ;*
    ;*	Step 6:		OR the results of step 4 and step 5 to obtain
    ;*	00000000000RRRRRGGGGGBBBBB000000 00000000000rrrrrgggggbbbbb000000
    ;*
    ;*	Step 7:		SHIFT RIGHT by 6 bits to obtain
    ;*	00000000000000000RRRRRGGGGGBBBBB 00000000000000000rrrrrgggggbbbbb
    ;*
    ;*	Step 8:		When two pairs of pixels are converted, pack the 
    ;*				results into one register and then store them into
    ;*				the q Matrix.
    ;*
    ;* Inputs:
    ;*	pPtr		long int  *	a pointer to the first element
    ;*					of the input 'p' matrix
    ;*	qPtr		short int *	a pointer to the output 
    ;*					RGB converted matrix
    ;*	nRows		short int *	the number of rows in the p/q matrix
    ;*	nCols		short int *	the number of columns in the p/q matrix
    ;*
    ;****************************************************************************/
    RGB24to16 PROC C USES ebx ecx edx,
    pMatrix:PTR DWORD, 
    qMatrix:PTR WORD,
    nRows:DWORD, 
    nCols:DWORD
    ;
    ; This calculates the number of elements in the 'p' matrix and assigns it to
    ; EAX.  EAX is then adjusted to contain the index to the last 8 pixel aligned
    ; pixel by performing ((nRows * nCols) - 8 + 7) & 0xfffffff8. Pointers to the 
    ; arrays 'p' and 'q' are also set up
    ;
    	mov	eax, nRows
    	mov	ebx, nCols
    	
    	imul	ebx				; EAX = total number of
    						; pixels to process
    	mov	ebx, pMatrix		; EBX points to 'pMatrix'
    	sub	eax, 1			; align index EAX to the 
    						; last 8 pixel boundary 
    	mov	edx, qMatrix		; EDX points to 'qMatrix'
    	and	eax, 0fffffff8H		; finish the index EAX
    						; alignment
    ;
    ; This section performs up to and including step 4 on pixels 0 and 1.  It also
    ; performs up to and including step 5 on pixels 2 and 3.  This is done prior to
    ; entering the loop so that better loop efficiency is achieved.  Better loop 
    ; efficiency is achieved because these instructions are paired with other 
    ; instruction at the end of the loop which could not be previously paired.
    ;
    	movq	mm7, DWORD PTR rgbMulFactor	; MM7 = pixel multiplication
    							;factor
    	movq	mm6, DWORD PTR rgbMask2		; MM6 = green pixel mask
    	movq	mm2, 8[ebx][eax*4]		; get pixels 2 and 3
    	movq	mm0, [ebx][eax*4]			; get pixels 0 and 1
    	movq	mm3, mm2				; copy pixels 2 and 3
    	pand	mm3, DWORD PTR rgbMask1		; get R and B of pixels 2 and 3
    	movq	mm1, mm0				; copy pixels 0 and 1
    	pand	mm1, DWORD PTR rgbMask1		; get R and B of pixels 0 and 1
    	pmaddwd mm3, mm7				; SHIFT-OR pixels 2 and 3
    	pmaddwd mm1, mm7				; SHIFT-OR pixels 0 and 1
    	pand	mm2, mm6				; get G of pixels 2 and 3
    ;
    ; This section performs steps 1 through 8 for 4 pairs of pixels (or for a total
    ; of 8 pixels).
    ;
    inner_loop:
    	movq	mm4, 24[ebx][eax*4]	; get pixels 6 and 7
    	pand	mm0, mm6			; get G of pixels 0 and 1
    	movq	mm5, 16[ebx][eax*4]	; get pixels 4 and 5
    	por	mm3, mm2			; OR to get RGB of pixels 2
    						; and 3
    	psrld	mm3, 6			; SHIFT pixels 2 and 3 (step 7)
    	por	mm1, mm0			; OR to get RGB of pixels 0
    						; and 1
    	
    	movq	mm0, mm4			; copy pixels 6 and 7
    	psrld	mm1, 6			; SHIFT pixels 0 and 1 (step 7)
    	pand	mm0, DWORD PTR rgbMask1	; get R and B of pixels 6 and 7
    	packssdw mm1, mm3			; combine pixels 0, 1, 2, and 3
    	movq  mm3, mm5			; copy pixels 4 and 5
    	pmaddwd mm0, mm7			; SHIFT-OR pixels 6 and 7
    	pand  mm3, DWORD PTR rgbMask1	; get R and B of pixels 4 and 5
    	pand  mm4, mm6			; get G of pixels 6 and 7
    							   
    	movq	[edx][eax*2], mm1		; store pixels 0, 1, 2, and 3
    	pmaddwd mm3, mm7			; SHIFT-OR pixels 4 and 5
    	sub	eax, 8			; subtract 8 pixels from the index
    	por	mm4, mm0			; OR to get RGB of pixels 6 and 7
    	
    	pand	mm5, mm6			; get G of pixels 4 and 5
    	psrld	mm4, 6			; loop iteration; SHIFT pixels 6 
    	movq	mm2, 8[ebx][eax*4]	; get pixels 2 and 3 for the next
    	por	mm5, mm3			; loop iteration ; OR to get RGB of
    						; and 7 (step 7)
    	movq	mm0, [ebx][eax*4]		; get pixels 0 and 1 for the next 
    	psrld	mm5, 6			; SHIFT pixels 4 and 5 (step 7)
    						; pixels 4 and 5
    	movq	mm3, mm2			; copy pixels 2 and 3
    	movq	mm1, mm0			; copy pixels 0 and 1
    	pand	mm3, DWORD PTR rgbMask1	; get R and B of pixels 2 and 3
    	packssdw mm5, mm4			; combine pixels 4, 5, 6 and 7
    	
    	pand	mm1, DWORD PTR rgbMask1	; get R and B of pixels 0 and 1
    	pand	mm2, mm6			; get G of pixels 2 and 3
    	movq	24[edx][eax*2], mm5	; store pixels 4, 5, 6 and 7
    	pmaddwd mm3, mm7			; SHIFT-OR pixels 2 and 3
    	
    	pmaddwd mm1, mm7			; SHIFT-OR pixels 0 and 1
    	jge	inner_loop			; if we need to do more jump to the
    						; beginning of the loop
    ;
    ; We have converted 24-bit true color to 16-bit high color for the given data!
    ;
    rgb24to16_done:
    	emms
    	ret	0				; we are done!
    RGB24to16 ENDP
    _TEXT	ENDS
    END
    

    * Legal Information © 1998 Intel Corporation